Fix race condition causing 'ack timeout 4' disconnects #4

mmakaay · 2021-04-11T10:04:33Z

A pull request, as requested by @OttoWinter in the esphome/issues#1921 (comment)

The AsyncClient::send() methods sets a boolean to true after pushing
data over the TCP socket successfully using tcp_output(). It also sets
a timestamp to remember at what time the data was sent.

The AsyncClient::_sent() callback method reacts to ACKs coming from
the connected client. This method sets the boolean to false.

In the AsyncClient::_poll() method, a check is done to see if the
boolean is true ("I'm waiting for an ACK") and if the time at which
the data was sent is too long ago (5000 ms). If this is the case,
a connection issue with the connected client is assumed and the
connection is forcibly closed by the server.

The race condition is when these operations get mixed up, because
of multithreading behavior. The _sent() method can be called during
the execution of the send() method:

send() sends out data using tcp_output()
_sent() is called because an ACK is processed, sets boolean to false
send() continues and sets boolean to true + timestamp to "now"

After this, the data exchange with the client was successful. Data were
sent and the ACK was seen.
However, the boolean ended up as true, making the _poll() method think
that an ACK is still to be expected. As a result, 5000 ms later, the
connection is dropped.

This commit fixes the code by first registering that an ACK is
expected, before calling tcp_output(). This way, there is no race
condition when the ACK is processed right after that call.

Additionally, I changed the boolean to an integer counter value.
The server might send multiple messages to the client, resulting in
multiple expected ACKs. A boolean does not cover this situation.

The AsyncClient::send() methods sets a boolean to true after pushing data over the TCP socket successfully using tcp_output(). It also sets a timestamp to remember at what time the data was sent. The AsyncClient::_sent() callback method reacts to ACKs coming from the connected client. This method sets the boolean to false. In the AsyncClient::_poll() method, a check is done to see if the boolean is true ("I'm waiting for an ACK") and if the time at which the data was sent is too long ago (5000 ms). If this is the case, a connection issue with the connected client is assumed and the connection is forcibly closed by the server. The race condition is when these operations get mixed up, because of multithreading behavior. The _sent() method can be called during the execution of the send() method: 1. send() sends out data using tcp_output() 2. _sent() is called because an ACK is processed, sets boolean to false 3. send() continues and sets boolean to true + timestamp to "now" After this, the data exchange with the client was successful. Data were sent and the ACK was seen. However, the boolean ended up as true, making the _poll() method think that an ACK is still to be expected. As a result, 5000 ms later, the connection is dropped. This commit fixes the code by first registering that an ACK is expected, before calling tcp_output(). This way, there is no race condition when the ACK is processed right after that call. Additionally, I changed the boolean to an integer counter value. The server might send multiple messages to the client, resulting in multiple expected ACKs. A boolean does not cover this situation.

OttoWinter

Thanks! 🎉

Misiu · 2021-04-15T11:44:55Z

@mmakaay what about ESPAsyncTCP-esphome?
AsyncTCP is used on ESP32, but ESPAsyncTCP is used on ESP8266 (ref: https://github.com/esphome/esphome/pull/1693/files#diff-1988fae5ec01f4c733d17749246f0facf70a29738259590d1bdf9225ad10cd3dR10-R15), that many of us still use.
I'm not sure if this error occurs there (I don't have the knowledge to tell that), maybe not, but I thought it's worth asking :)

mmakaay · 2021-04-15T11:59:40Z

I have looked at ESPAsyncTCP while I was tracking this issue, and it does use the same construction to keep track of ACKs.
I concluded that is might therefore suffer from the same kind of disconnection issues (does it?).

However, a lot has been done to that code, and I saw some very specific comments about possible crashes and how to prevent them. The two libraries looked so different at places, that I could not decide right away if the same fix was needed.
If this type of disconnect error is occurring for ESP8266 too, then I'm pretty sure that the same fix can be applied.

Somewhere in the back of my mind, I made a mental note about checking out the actual differences between these two libraries, because when looking at them, I wondered if the ESP32 AsyncTCP library wouldn't benefit from some of the developments in the ESPAsyncTCP library.

mmakaay · 2021-04-22T15:54:14Z

Hey @OttoWinter , based on the revert of the AsyncTCP version bump by @glmnet : Do you agree it would be best to revert the patch from this pull request? It's no good if there are devices that show regression in the form of extra disconnects.

OttoWinter · 2021-05-03T10:25:50Z

@mmakaay Agree, though it would make sense to understand how that regression could be caused too. Maybe we're encountering another race condition now because some code is executed on another ESP32 core in parallel (or interleaved execution), so it might make sense to use mutexes in this code.

mmakaay · 2021-05-03T10:53:06Z

Yes, I think that mutexes might improve this.
I have tried the code on all of my devices and for me this improved their API connections. So I wasn't able to trigger the regression myself. What would be interesting to know, is what exact messages are logged on the console at the time of the disconnects. Since that would tell us if the disconnects were related to the ACK handling.

What I didn't look into further so far, is how this setup would work when actual ACKs got missing. Where are the retransmits for this arranged? Should lwIP handle these? The AsyncTCP ACK handling is considering a sunny day scenario, where no ACKs get lost, and considers a lost ACK as a reason to kick the connection. That is not the best scenario to work with. Possibly the regression hits when there are actual packages getting lost on the network? It shouldn't kill the connection in such case though, since the whole reason for TCP is allowing for such issues. From a TCP perspective, the connection might be considered sub-optimal, but very much usable nonetheless.

I can dive a bit deeper in the stack and see what lwIP actually does. If that layer does handle ACKs and retransmits, I'm not so sure that ACK handling event ought to be in AsyncTCP to begin with.

OttoWinter · 2021-05-03T11:00:40Z

Yes lwIP does handle retransmits :)

iirc the reason for it is for cleanup purposes. With the ack logic you know when it's safe to write to a buffer again (if the buffer wasn't copied to lwIP's internal one before). But not 100% sure about that anymore.

glmnet · 2021-05-05T03:34:55Z

I managed to reproduce this issue by introducing a delay in the send

bool AsyncClient::send()
{
    int8_t err = ERR_OK;
    err = _tcp_output(_pcb, _closed_slot);
    if (err == ERR_OK)
    {
        vTaskDelay(50 / portTICK_RATE_MS);
        // The 50 is to give the network plenty of time to send the ack,
        // I tried with 25 and there were no timeouts
        _pcb_busy = true;
        _pcb_sent_at = millis();
        return true;
    }
    return false;
}

And fixed it with this (@mmakaay ideas)

bool AsyncClient::send()
{
    int8_t err = ERR_OK;
    auto pcb_sent_at_back = _pcb_sent_at;
    auto pcb_busy_back = _pcb_busy;
    _pcb_sent_at = millis();
    _pcb_busy = true;
    err = _tcp_output(_pcb, _closed_slot);
    if (err == ERR_OK)
    {
        return true;
    }
    _pcb_busy = pcb_busy_back;
    _pcb_sent_at = pcb_sent_at_back;
    return false;
}

also marked _pcb_busy and _pcb_sent_at as volatile

  volatile bool _pcb_busy;
  volatile uint32_t _pcb_sent_at;

_pcb_busy is there just to tell us that _pcb_sent_at has a valid value, and ack timeout should be observed, no harm at all, so both values are backup in case tcp_output fails.
In case a timeout is going on and tcp_output fails, the timeout will be "delayed" but values will be reverted and thus the timeout will happen even if send is called quickly

mmakaay · 2021-05-05T08:36:07Z

Was making them volatile required to make it work correctly?

Last night, I came up with a slighly different and possibly cleaner fix. I will explain my code below here.

Before I explain the code, here a consideration that I included in this change:
On thing I found weird, is that the ACK check presumably does a check on ACKs, but it uses _rx_last_packet for this check. That field is updated on both incoming data packets and incoming ACKs. Thus, we might still be waiting for an ACK, but an incoming data packet would clear the timeout.
Likewise, the RX timeout would be cleared by an incoming ACK, however this might be fine (when the check implements the use case "check if we see any traffic coming in from the client").

On to the code changes:

Removed _pcb_busy, because it is not needed for the timeout logic.
Renamed _pcb_sent_at to _tx_last_packet. That naming is more in line with the existing _rx_last_packet.
Introduced a field _rx_last_ack, to let the ACK timeout check for ACK, not for "any packet incoming", as per my consideration from above.

This makes the block of related fields from the AsyncTCP.h:

    bool _ack_pcb;
    uint32_t _tx_last_packet;
    uint32_t _rx_ack_len;
    uint32_t _rx_last_packet;
    uint32_t _rx_since_timeout;
    uint32_t _rx_last_ack;
    uint32_t _ack_timeout;
    uint16_t _connect_port;

A few side-remarks:

for fun, check the use of _ack_pcb. It is always true, and there's one conditional that will always fail because of that. This looks like code that needs to be cleaned up.
rx_since_timeout could be renamed to rx_timeout. The since bit is confusing to me.

I removed all uses of the _pcb_busy.
Next to that, the code from AsyncTCP.cpp that I changed is:

A much simplified send method, which now register the last transmitted packet if a send succeeds.

bool AsyncClient::send(){
    auto backup = _tx_last_packet;
    _tx_last_packet = millis();
    if (_tcp_output(_pcb, _closed_slot) == ERR_OK) {
        return true;
    }
    _tx_last_packet = backup;
    return false;
}

The method that handles an incoming ack now updates the _rx_last_ack:

int8_t AsyncClient::_sent(tcp_pcb* pcb, uint16_t len) {
    _rx_last_packet = millis();
    _rx_last_ack = millis();
    // ... snip
}

The logic for the ACK timeouts now checks if the last incoming ack time is before the last sent packet time. That eiliminates the need for a boolean to enable the ACK timeout check.

int8_t AsyncClient::_poll(tcp_pcb* pcb){
    // ... snip
    // ACK Timeout
    if(_ack_timeout && _tx_last_packet > _rx_last_ack && (now - _tx_last_packet) >= _ack_timeout){
        log_w("ack timeout %d", pcb->state);
        if(_timeout_cb)
            _timeout_cb(_timeout_cb_arg, this, (now - _tx_last_packet));
        return ERR_OK;
    }
    // ... snip
}

This code is giving me very stable connections. I have 8 OTA loggers and HA active, and they now run as stable for me as when I fully disable the ACK timeout checks.

Do you want me to bundle this up as a new pull request?

glmnet · 2021-05-05T13:53:49Z

The volatile added just by instinct.

Regarding your code you should check your timestamp comparison. Otherwise looks good to me.

_tx_last_packet > _rx_last_ack

should check for rollovers, see https://arduino.stackexchange.com/questions/12587/how-can-i-handle-the-millis-rollover

mmakaay · 2021-05-05T21:08:53Z

Good point. I was wondering if it would be worth it, but let's prevent that once per 50 days possible disconnect issue 👍
How about this?

    if(_ack_timeout){
        uint32_t one_day = 86400000;
        bool last_tx_is_after_last_ack = (_rx_last_ack - _tx_last_packet + one_day) < one_day;
        if(last_tx_is_after_last_ack && (now - _tx_last_packet) >= _ack_timeout) {
            log_w("ack timeout %d", pcb->state);
            if(_timeout_cb)
                _timeout_cb(_timeout_cb_arg, this, (now - _tx_last_packet));
            return ERR_OK;
        }
    }

mmakaay · 2021-05-05T22:21:36Z

Okay, a followup pull request was created at #5

I did not include the volatile flags, because I do not think they would be useful.
Reading up on the combination of volatile plus threading behavior in this Stack Overflow post didn't convince me they would.
I you do want to have those variables volatile anyhow, let met know in the new PR and I can update it.

@glmnet

After my first attempt at fixing the client disconnects (OttoWinter#4) got merged into AsyncTCP, it turned out that there was regression for some devices: the connection stability actually went down instead of up. After a lot of debugging and discussion with @glmnet (some of the results can be found in the above pull request discussion), we came up with an improved fix for the disconnect issues. **Changed:** The code that checks for ACK timeouts has been simplified in such way, that only two timestamps are now used to determine if an ACK timeout has happened: the time of the last sent packet (this was already recorded), and the time of the last received ACK from the client (this has been added). Using these timestamps, there is no more need for a separate field to keep track if we are waiting for an ACK or not (`_pcb_busy`). Therefore, this field was completely removed from the code. While I was at it, I renamed a few variables to make the code easier to read and more consistent. **Results:** I connected Home Assistant plus 8 OTA loggers at the same time, using very verbose logging output. This normally was an easy way to trigger the disconnect errors. It turned out, this solution runs as solid for me, as when disabling the ACK timeout checks completely (using `AsyncClient::setAckTimeout(0)`). (cherry picked from commit 6f37967)

@glmnet

After my first attempt at fixing the client disconnects (OttoWinter#4) got merged into AsyncTCP, it turned out that there was regression for some devices: the connection stability actually went down instead of up. After a lot of debugging and discussion with @glmnet (some of the results can be found in the above pull request discussion), we came up with an improved fix for the disconnect issues. **Changed:** The code that checks for ACK timeouts has been simplified in such way, that only two timestamps are now used to determine if an ACK timeout has happened: the time of the last sent packet (this was already recorded), and the time of the last received ACK from the client (this has been added). Using these timestamps, there is no more need for a separate field to keep track if we are waiting for an ACK or not (`_pcb_busy`). Therefore, this field was completely removed from the code. While I was at it, I renamed a few variables to make the code easier to read and more consistent. **Results:** I connected Home Assistant plus 8 OTA loggers at the same time, using very verbose logging output. This normally was an easy way to trigger the disconnect errors. It turned out, this solution runs as solid for me, as when disabling the ACK timeout checks completely (using `AsyncClient::setAckTimeout(0)`).

OttoWinter approved these changes Apr 11, 2021

View reviewed changes

OttoWinter merged commit 030b747 into OttoWinter:master Apr 11, 2021

This was referenced Apr 12, 2021

LoadProhibited crash in esphome/components/api/api_server.cpp esphome/issues#1921

Closed

Bump AsyncTCP-esphome to 1.2.1. esphome/esphome#1693

Merged

mmakaay mentioned this pull request Apr 22, 2021

Add DISABLE_ASYNCTCP_ACK_TIMEOUTS build flag. esphome/esphome#1712

Closed

12 tasks

mmakaay mentioned this pull request May 5, 2021

Better fix for "ack timeout 4" client disconnects. #5

Closed

glmnet mentioned this pull request May 9, 2021

Better fix for "ack timeout 4" client disconnects. esphome/AsyncTCP#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition causing 'ack timeout 4' disconnects #4

Fix race condition causing 'ack timeout 4' disconnects #4

mmakaay commented Apr 11, 2021

OttoWinter left a comment

Misiu commented Apr 15, 2021

mmakaay commented Apr 15, 2021

mmakaay commented Apr 22, 2021

OttoWinter commented May 3, 2021

mmakaay commented May 3, 2021

OttoWinter commented May 3, 2021

glmnet commented May 5, 2021

mmakaay commented May 5, 2021 •

edited

Loading

glmnet commented May 5, 2021 •

edited

Loading

mmakaay commented May 5, 2021

mmakaay commented May 5, 2021

Fix race condition causing 'ack timeout 4' disconnects #4

Fix race condition causing 'ack timeout 4' disconnects #4

Conversation

mmakaay commented Apr 11, 2021

OttoWinter left a comment

Choose a reason for hiding this comment

Misiu commented Apr 15, 2021

mmakaay commented Apr 15, 2021

mmakaay commented Apr 22, 2021

OttoWinter commented May 3, 2021

mmakaay commented May 3, 2021

OttoWinter commented May 3, 2021

glmnet commented May 5, 2021

mmakaay commented May 5, 2021 • edited Loading

glmnet commented May 5, 2021 • edited Loading

mmakaay commented May 5, 2021

mmakaay commented May 5, 2021

mmakaay commented May 5, 2021 •

edited

Loading

glmnet commented May 5, 2021 •

edited

Loading